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Statistical models for describing the probability distribution over the states of biological systems are 
commonly used for dimensional reduction. Among these models, pairwise models are very attractive in 
part because they can be fit using a reasonable amount of data: knowledge of the means and correlations 
between pairs of elements in the system is sufficient. Not surprisingly, then, using pairwise models for 
studying neural data has been the focus of many studies in recent years. In this paper, we describe how 
tools from statistical physics can be employed for studying and using pairwise models. We build on our 
previous work on the subject and study the relation between different methods for fitting these models and 
evaluating their quality. In particular, using data from simulated cortical networks we study how the quality 
of various approximate methods for inferring the parameters in a pairwise model depends on the time bin 
chosen for binning the data. We also study the effect of the size of the time bin on the model quality itself, 
again using simulated data. We show that using finer time bins increases the quality of the pairwise model. 
We offer new ways of deriving the expressions reported in our previous work for assessing the quality of 
pairwise models. 
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1 Introduction 



In biological networks the collective dynamics of thousands to millions of interacting elements, generating 
complicated spatiotemporal structures, is fundamental for the function. Until recently, our understanding 
of these structures was severely limited by technical difficulties for simultaneous measurements from a large 
number of elements. Recent technical developments, however, are making it more and more common that 
experimentalists record data from larger and larger parts of the system. A living example of this story is 
what is now happening in Neuroscience. Until a few years ago, neurophysiology meant recording from a 
handful of neurons at a time when studying early stages of sensory processing (e.g. the retina), and only 
single neurons when studying advanced cortical areas (e.g. inferotemporal cortex). This limit is now rapidly 
going away, and people are collecting data from larger and larger populations (see e.g. the contributions in 
[l]). However, such data will not help us understand the system per se. It will only give us large collections 
of numbers, and most probably we will have the same problem that we had with the system, but now with 
the data set: we can only look at small parts of it at a time, without really making use of its collective 
structures. To use such data and make sense of it we need to find systematic ways to build mathematically 
tractable descriptions of the data, descriptions with lower dimensionality than the data itself, but involving 
relevant dimensions. 

In recent years, binary pairwise models have attracted a lot of attention as parametric models for studying 
the statistics of spike trains of neuronal populations O [Sj HJ [H El [7] , the statistics of natural images [5, 
and inferring neuronal functional connectivities [3l [9] . These models are the simplest parametric models 
that can be used to describe correlated neuronal firing. To build them one only needs the knowledge of the 
distribution of spike probability over pairs of neurons, and, therefore, these models can be fit using reasonable 
amounts of data. As is the case for any parametric model, one would like to know how useful a pairwise 
model is, that is, how well it can describe the statistics of spike trains and whether specific questions about 
the network can be studied using the fitted model and its parameters. Furthermore, one would like to have 
fast and reliable ways to fit the model. These issues can be naturally studied in the framework provided 
by statistical physics. Starting from a probability distribution over the states of a number of elements, 
statistical physics deals with computing quantities such as correlation functions, entropy, free energy, energy 
minima, etc., through exact or carefully developed approximate methods. It allows one to give quantitative 
answers to questions such as: what is the entropy difference between real data and the parametric model? 
Is there a closed-form equation relating model parameters to the correlation functions? How well does a 
pairwise model approximate higher order statistics of the data? 

In this paper, after briefly reviewing the experimental results on pairwise models, we discuss the statistical 
physics approaches that are useful for studying these models and what we find by employing them. We 
do this with the aim of providing a coherent framework for fitting and using pairwise models as well as 
evaluating their quality. We describe a number of approximate methods for fitting pairwise models, their 
underlying assumptions, their relation to each other, and the physical intuition behind them. We also study 
the quality of pairwise models to model the probability distribution over the spike trains. We have studied 
these issues in our previous work [5l[6], and, here, we expand on this work in three ways. We study how the 
quality of approximate inference methods described in [6] and the quality of the pairwise models depend 
on the bin size chosen for binning the data. We describe new ways of deriving the perturbative expansion 
we reported in [5] as well as of one approximation that we find to perform well (the TAP approximation). 
We also study how the probability of synchronous spikes according to the independent and pairwise models 
compare with the true probabilities. 

In the first part of the paper, we describe the naive mean-field approximation and the independent-pair 
approximation for fitting the model parameters. When applied to binned spike trains, these approximations 
provide reliable estimates of the model parameters when the size of the system and/or the time bins are 
small. We then describe how these approximations can be corrected to provide very accurate approximations 
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for estimating the parameter even for large populations and large time bins: the Thouless- Anderson-Palmer 
approximation and the Sessak-Monasson approximation. 

We then go on to study the quality of the pairwise model. As briefly reviewed in sec. [21 several experi- 
mental reports showed that the pairwise models can provide perfect approximations to the true probability 
distribution of small populations of cells. Following this experimental work, Roudi et al [5] performed a the- 
oretical analysis of the pairwise models to understand if their success on small populations can be extended 
to real-sized system. To do this, they derived equations for the entropy difference between the pairwise 
model and the data, as well as between an independent-neuron model and the data. This was done in 
the regime where the average number of spikes generated in a time bin by the whole population is small 
compared to one, i.e. when N6 <^ 1 where 



where N is the number of neurons in the populations, Ui is the firing rate of neuron i and 6t is the size 
of the time bins. We denote this regime "the low-rate regime" (also called "the perturbative regime" [5]), 
however, we emphasize that it is really the quantity N5 that defines this regime, and not only the firing rate. 
In [5], the authors showed that in the low-rate regime, most of the difference between the true entropy and 
the entropy of the independent-neuron model can be explained by the pairwise model. However, this fact 
happens regardless of whether the true distribution for large N can be well approximated by the pairwise 
model. In other words, observing a good pairwise model in the low-rate regime does not tell us if the pairwise 
model will be a good model for the real sized system. 

A crucial step in the derivation of entropy differences in was to express the pairwise distribution 
in the so called Sarmanov-Lancaster representation [10\ Here we show that one can derive the same 
expressions by approximating the partition function of a Gibbs distribution (Eq. ([3]) in sec. II. ip . In addition 
to recovering the low-rate expansion of the entropies, we use the results to find the difference between the 
probability of synchronous spikes according to the true model, the pairwise model and the independent 
model. We also show that one can derive the results of [5] by extending the idea behind the independent- 
pair approximation to triplets of neuron, i.e. the independent-triplet approximation. By taking the limit 
ViSt — > of the triplet approximation to the entropy of a given distribution, we show that the results of [5] 
can be recovered in a considerably simpler way. 

An important step in using binary pairwise models is making a binary representation of the spike trains. 
This is done by binning the spike trains into small time bins and assigning zero or one to each bin depending 
on whether there is a spike in it or not. In fact, it was predicted in [5] that using finer and finer time bins, 
improves the quality of the pairwise model. In this work we study the infiuence of the bin size on the quality 
of the approximate methods of fitting as well as the quality of the model. 

Finally, we discuss two possible extensions of the binary pairwise models and mention a number of 
important questions that they raise. We first describe the extensions to non-binary variables useful for 
studying the statistics of modular models of cortical networks. We then describe an extension of the 
pairwise model to a model with asymmetric connections which gives promising results for discovering the 
synaptic connectivity from neural spike trains. 

1.1 The binary pairwise model 

In a binary pairwise model, starting from the spikes recorded from N neurons, one first divides the spike 
trains into small time bins. One then builds a binary representation of the spike trains by assigining a binary 
spin variable Si{t) to each neuron i and each time bin t, with Si{t) = — 1 if neuron i has not emitted any 
spikes in that time bin, and Si{t) = 1 if it has emitted one spike or more. From this binary representation. 
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the means and correlations between the neurons are computed as 

{Si)da.ta. = j;^Si{t) (2a) 
t 



where T is the total number of time bins. 

The binary pairwise model of the data is then built by considering the following distribution over a set 
of N binary variable s = (si, S2 • ■ ■ > •stv) 



Ppair(s) = ^exp 



^ ^ hi Si + ^ ^ JijSiSj 
i i<j 



(3) 



and chosing the parameters, hi and Jij such that the means and pairwise correlations under this distribution 
matches those of the data defined in Eq. that is 

(•5i)pair = ^^J'pair(s) Si = (sj)data (4a) 
s 

(SjSj)pair = ^ '] Ppair (s) SiSj = {SiSj)da,ta,- {^^) 
s 

Following statistical physics terminology, the parameters hi are usually called the external fields and Jij, 
the pairwise couplings or pairwise interactions. One important property of the pairwise distributions in 
Eq. ([3]) is that it has the maximum amount of entropy among all the distribution that have the same mean 
and pairwise correlations as the data, and it is thus usually called the maximum entropy pairwise model. 
It is also called the Ising model, in accordance with the Ising model introduced in statistical physics as a 
simple model of magnetic materials. 

Although typically written in terms of ±1 spin variables, it is sometimes useful to write the pairwise 
distribution of Eq. ([3]) in terms of boolean variables rj = (si + l)/2. In the rest of the paper, we sometimes 
use such boolean representation as some of the calculations and equations become considerably simpler in 
this representation. We denote the external fields and the couplings in the boolean representations by Tii 
and jJij. 

The external fields and pairwise couplings can be found using both exact and approximate methods as 
discussed in details in [6] and reviewed briefly in the following sections. Before reviewing these methods, 
we first review the experimental studies that have used the maximum entropy pairwise model to study the 
statistics of spike trains. 



2 Review of experimental results 

The maximum entropy pairwise model as a model for describing the statistics of neural firing patterns was 
introduced in [2] and [3]. This work used the Ising model to study the response of retinal ganglion cells 
to natural movies [2], steady spatially uniform illumination and white noise [3], as well as the spontaneous 
activity of cultured cortical networks [2] . The main goal of these studies was to find out how close the fitted 
pairwise model is to the true experimentally computed distribution over s. 

As a measure of distance, these studies compared the entropy difference between the true distribution 
and the pairwise model and compared it with the entropy difference between the the true distribution and 
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an independent model. The independent model is a distribution that has the same mean as the data but 
assumes the firing of each neuron is independent from the rest, i.e. 

1 r 1 exp rV /i™^s l 

^'^'^'^ = ^ n [(-^)data<5..,l + (1 - (..)data)5s.-lj = ^ COsh(/.^°V ^^^^^ 

hf"^ = tanh~\{si)) (5b) 
The measure of misfit can thus be defined as 



^ _ 'S'pair 'S'true /gs 
'-'ind '-'true 

where the overline indicates averaging with respect to many samples of neurons. The results of Schneidman 
et al showed that for populations of size = 10 or so, A was around 0.1. In order words, in terms of 
entropy difference, the pairwise model offered a ten fold improvement over the independent model. In 
the other study, Shlens et al [3] found A ~ 0.01 for N = 7. These authors also considered a slightly 
different model in which only the pairwise correlation between the adjacent cells was used in the fit, and 
correspondingly the pairwise interactions between non-adjacent cells were set to zero. The results showed 
that this adjacent pairwise model also performed very well, with A ~ 0.02 on average. It is important to 
note that in Schneidman et al the stimulus induced long range correlations between cells, while in the data 
studied Shlens et al the correlations extended only to nearby cells. Following these studies, Tang et al [3j 
reproduced these observations to a large extent in other cortical preparations and also concluded that the 
pairwise model can successfully approximate the multi-neuron firing patterns. In a very recent study [7], 
Shlens et al extended their previous analysis to population of up to 100 neurons concluding that the adjacent 
pairwise model performs very well even in this case, with A ~ 0.01 — 0.02. The studies of Shlens et al were 
done without stimulation or with white noise stimulus, situations in which neurons do not exhibit strong 
long range correlations. It is still unclear how the pairwise model (not necessarily adjacent) will perform for 
cases in which neuronal correlations exist between neurons separated by large distance, i.e., when stimulated 
by natural scenes. 

The assumption that A is a good measure of distance rests upon the assumption that we are inter- 
ested in finding out how different the true and model distributions are over the whole space of possible 
spike patterns. This can be appreciated when we note that the definition in Eq. ([6]) is equivalent to 
A = DKLiptrne\\Ppa.ir)/DKL{ptrne\\Pmd), where DKL{p\\q) is the Kullback-Leibler divergence between two 
distribution p(s) and g(s) defined as DKiipWQ) = X]sP(^) Using this observation, we can 

think of A as a weighted sum of the difference between the log probability of states according to the true 
distribution and according to the model distribution normalized by the distance to the independent model. 
Of course, we may not be interested in finding how different the two distribution are over the space of all 
possible states, but only how differently the model and true distributions assign probabilities to a subset 
of important states. For instance, in a particular setting, it may be important only to build a good model 
for the probability of all states in which some large number of neurons M fire simultaneously (i.e., when 
^jTj = M), regardless of how different the two distribution are on the rest of the states. It was found in 
[12j that the pairwise model offers a significant improvement over the independent model in modelling the 
experimental probability of synchronous multi-neuron firing of up to A'^ = 15 neurons. 



3 Approximations for fitting binary pairwise models 

The commonly used Boltzmann Learning algorithm for fitting the parameters of the Ising model is a very 
slow process, particularly for large A^. Although effort has been made to speed up the Boltzmann learning 
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algorithm [T3], such modified Boltzmann algorithms still require many gradient descent steps and long 
Monte Carlo runs for each step. This fact motivates the development of fast techniques for calculating 
the model parameters which do not rely on gradient descent or Monte Carlo sampling. A number of such 
approximations have been studied in [6]. In what follows we describe these approximations, and in particular 
the relation between the simpler approximations (Naive mean-field approximation and the independent-pair 
approximation) to the more advanced ones (TAP approximation and Sessak-Monasson approximation). 



3.1 Naive mean-field approximation (nMF) and the independent-pair approximations 
(IP) 

The simplest method |141 [T5] for finding the parameters of the Ising models from the data uses mean field 
theory: 



tanh ^ nii = hi + Jj. 



(7) 



where mj = (si)data- These equations express the effective field that determines the magnetization rrii as 
the external field plus the sum of the infiuences of other spins through their average values rrij , weighted by 
the couplings Jij. Differentiating with respect to a magnetization rrij gives the inverse susceptibility (i.e., 
inverse correlation) matrix 



(C' 



-J, 



(8) 

for i ^ where Cij = (sjSj)data — miiTij. Thus, if one knows the means nii and correlations Cij of the 
data, one can use Eq. ([8]) to find the Jij and then solve Eq. ([7|) to find the hi. We call this approximation 
"naive" mean-field theory (abbreviated nMFT) to distinguish it from the TAP approximation described 
below, which is also a mean-field theory. 

In the independent-pair (IP) approximation, one solves the two-spin problem for neurons i and j, ignoring 
the rest of the network. This yields the following expressions for the parameters in terms of the means and 
correlations: 



IP 



/i| = log 



((1 + mi){l + ruj) + Cij)){{l - mi){l - rrij) + dj) 
((1 - mi){l + ruj) + C,j))((l + mi){l - mj) + dj) 
{l + mi){l-mj) -Q 



{I - mi){l - rrij) + Q 



+ J, 



(9a) 
(9b) 



where /i] is the external field acting on i when it is considered in a pair with j. It has been noted in [6] that 
in the limit rrii ^ —1 and mj — > — 1, jjj matches the leading order of the low-rate expansion derived in [H]. 

Although the couplings found in the independent-pair approximation can be directly used as an approx- 
imation to the true values of Jij, relating the fields /i^ found from the independent-pair approximation to 
those of the model is slightly tricky. The reason is that the expression we find depends on which j we took to 
pair with neuron i. It is natural to think that we can sum over j, i. e., over all possible pairings of cell i, 
to find the Ising model parameter hi. In doing so, however, we should be careful. The expression in Eq. ()9bp 
has two types of terms, those that only depend on i, i.e. the first terms in the following decomposition, and 
those that involve j, i.e. the second and third terms below 



h^ 



1 



log 



l + rrii 
1 



+ ^log 



1 - rrij - Cij /{I + rrii) 
1 - irij + Cij /{I - rrii) 



+ J, 



(10) 



The first terms is the field that would have been acting on i if it were not connected to any other neuron, and 
the rest are contributions from interactions with j. By simply summing /i] over j, we will be overcounting 
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this term, once for each pairing. In other words, the correct independent-pair approximation for hi will be 




1 + TTij 



I - mi 




l-nij- Cij/{1 + rui) 
l-mj + Cij/{1 - nii) 



(11) 



Although simple in its derivation and intuition, in the limit rrii ^ —1 for all i, Eq. (jlip recovers both the 
leading term and the first order corrections of the low-rate expansion, as shown in Appendix [Bj 

The simple naive mean-field and independent-pair approximations have been shown to perform well in 
deriving the parameters of the Ising model when the population size is small. 

In [6], we showed that for data binned at 10 ms, the naive mean-field and the IP approximations perform 
well in deriving the parameters of the Ising model when the population size is small. In Fig. [T] we extend 
this study and evaluate how the quality of these approximations depend on the size of the time bin, 6t. 
In this figure, we plot the value between the couplings found from nMFT and IP approximations and 
the results of long Boltzmann runs as a function of the time bin chosen to bin the data. We do this for 
both iV = 40 and = 100. The simulations used to generate the spike trains and the Boltzmann learning 
procedure used in this figure are the same as those reported in [6], with two exceptions. The first one is 
that here we use more gradient descent steps and longer Monte Carlo runs, namely, 60000 gradient descent 
steps and 40000 Monte Carlo steps per gradient descent step. The second one is that here we use 10000 
seconds worth of data for estimating the means and correlations. This is 2.5 times larger than what we 
used before. Both of these improvements were made to ensure reliable estimates of the parameters, as well 
as the means and correlations, particularly for fine time bins. As can be seen in Fig. [H increasing either A^ 
or 5t results in a decay in the quality of the IP approximation, as well as its low-rate limit. For the case of 
nMFT, a reasonable performance is observed only for 6t = 2ms and A^ = 40. For large populations sizes 
and/or time bins nMFT is a bad approximation. Given the strong dependence of the quality of these simple 
approximations on population size and the size of the time bin, we describe below how one can extend 
these approximations to obtain more accurate expressions for finding the external fields and couplings of 
the pairwise model for large A^ and 6t. 

3.2 Extending the independent-pair approximation 

Extending the independent-pair approximation is in principle straightforward. Instead of solving the prob- 
lem of two isolated spins, we can solve the problem of three spins, i, j, and k as shown in Appendix [Cl 
This will lead to the independent-triplet (IT) approximation. In Fig. [U we show the quality of the IT 
approximation for finding the couplings as compared to the Boltzmann solutions. For A^ = 40, we see that 
the IT approximation provides an improvement over IP for different values of 6t. In Fig. [H we also looked 
at the quality of the low rate limit of the IT approximation. In Appendix Owe show that, in the same way 
that the low rate limit of the IP approximation gives us the leading order terms of the low-rate expansion in 
[5], the low rate limit of the IT approximation gives us the first order corrections to it. For A^ = 40, this is 
evident in the fact that the low-rate limit of IT outperforms the low-rate limit of IP for 6t < 15 ms. When 
the population size is large, however, IT and its low rate limit outperform IP for only very fine time bins. 
Even for 5t = 4: ms, IP and IT and their low rate limits perform very bad. 

One can of course build on the idea of the IP and IT approximations and consider n=4, 5, ... spins. 
However, for any large value of n, this will be impractical and computationally expensive, for the following 
reason. As described in Appendix [O for the case of the independent-triplet approximation, there are two 
steps in building an independent-n-spin approximation. The first one is to express the probability of each of 
the 2" possible states of a set of n spins in terms of the means and correlations of these spins. This requires 
inverting the 2" x 2" matrix. The second step, which is only present for n > 3, is to express all correlations 
functions in terms of the means and pairwise correlations. Both of these steps become exponentially hard 
as n grows. Nonetheless, as shown in Appendix [Cj even going to the triplet level can be a very useful 
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Figure 1: The quality of various approx- 
imations for different time bins and pop- 
ulations sizes. Here we plot the B? val- 
ues between the couplings obtained using 
various approximate methods and those 
found from Boltzmann learning versus 5t 
(a) = 40 and (b) N = 100. In both 
panel the colour code is as follows. Black, 
SM; Red, TAP; Blue, SM-TAP hybrid; 
Green, nMFT; Cyan, IP; Magenta low- 
rate limit of IP; Cyan with dashed line, 
IT; Megenta dashed lines, low rate limit 
of IT. For both population sizes and all 
time bins, the TAP, SM and hybrid ap- 
proximations perform very well. 



exercise, as it offers a new way of computing the difference between the entropy of the true model and that 
of the independent model (Eq. (jl5ap in sec. 14. Ih as well as the difference between the entropy of the true 
model and that of the pairwise model (Eq. ()15bp 14. This derivation is considerably simpler than the 
original derivation of these equations based on the Sarmanov-Lancaster representation of the probability 
distribution described in [5| as well as the derivations in Appendix [XJ in which one starts by expanding the 
partition function of a Gibbs distribution. Furthermore, the IT approximation also yields a relation between 
the couplings and the means and pairwise correlations that coincides with leading term and the corrections 
found by low-rate expansion, as shown in Appendix [Cl 

3.3 Extending the naive Mean-Field: TAP equations 

The naive mean-field, independent-pair and independent-triplet approximations are good for fitting the 
model parameters when the typical number of spikes generated by the whole population in a time bin is 
small compared to one, i.e., when N5 ^ 1 [5l[6]. For large and/or high firing rate populations, however, such 
approximations perform poorly for inferring the model parameters. It is possible to make simple corrections 
to the naive mean-field approximation such that the resulting approximation performs well even for large 
populations. This is the so called Thouless- Anderson-Palmer (TAP) approximation [16]. The idea dates 
back to Onsager, who added corrections to the naive mean-field approximation, taking into account the 
effect of the magnetization of a spin i on itself via its influence on another spin j. Subsequently, it was 
shown that the resulting expression was exact for infinite-range spin-glass models [16]. The TAP equations 
are 

tanh~^ nii = hi + Jip^j ~ Jiji^^ii^ — m'j). (12) 
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Differentiation with respect to nij tlien gives {i ^ j) 

(C~^)ij = -Jij - 2mimjjfj (13) 

One can solve Eqs. (jl3p for the Jij and, after substituting the result in Eqns. (jl2p . solve Eqns. (|12p for the 
hi [laill]. 

There are several ways to derive this expression for the pairwise distribution Eq. ([3]). In Appendix [Dl 
we show that these equations can also be derived from the celebrated Belief Propagation algorithm used in 
combinatorial optimization theory [T7]. When applied to spike trains from populations of up to 200 neurons, 
the inversion of TAP equations was shown to give remarkably accurate results [6] for fitting the pairwise 
model. In Fig. El we show scatter plots comparing the couplings found by the TAP approximation versus 
the Boltzmann results for N = 100 and 5t = 2, 10, 32 ms. The TAP approximations does well in all cases, 
and this is quantified in Fig. [H In Fig. [1], we demonstrate the power of inverting the TAP equations for 
inferring the couplings for various time bins for both = 40 and N = 100. 



3.4 Sessak-Monasson approximation (SM) 

Most recently, Sessak and Monasson [18] developed a perturbative expansion expressing the fields and 
couplings of the Ising distribution as a series expansion in the pairwise correlations function Cij. Some of 
the terms of the expression they found for the couplings could be summed up. It was noted in [6], that 
one can think of the resulting expression as a combination of the naive mean-field approximation and the 
independent-pair approximation. The Sessak-Monasson result can be written as 

tSM _ _/^-iv , tIP /-..N 

The reason why the last terms should be subtracted is discussed below. 

Let us consider two neurons connected to each other. In the independent-pair approximation we calculate 
the fields and couplings for this pair exactly within the assumption that they do not affect the rest of the 
network and vice versa. If we were to find the coupling between this pair of neurons using the naive mean- 
field approximation Eq. dS]), the result would just be the last term in Eq. (I14p . The reason why we should 
subtract it is now clear: the first term in Eq. (jl4p includes a naive mean-field solution to the pair problem. 
We subtract this part and replace it by the exact solution of the pair problem. Fig. [T] shows that this result 
is very robust to changing 5t, although for N = 100, we can note a small decay in with 5t. The good 
performance of the SM approximation can be also seen in the scatter plots shown for = 100 in Fig. 
[2j These observations support the SM approximation as a very powerful way of inferring the functional 
connections. Following the observation made in [5], in Fig. [1] we also show how a simple averaging of the 
best approximate methods, i.e., TAP inversion and SM can provide a very accurate approximation to the 
couplings across different time bin and population sizes. 



4 Assessment of model quality 

The experimental result that the binary pairwise models provide very good models for the statistics of spike 
trains is very intriguing. However, the message they carry about the architecture and function of the nervous 
system is not clear. This is largely due to the fact that, as reviewed in sec. [21 the experimental studies were 
conducted on populations of small number of neurons (A^ ~ 10) and their implications on the real sized 
system are not trivial. Is it the case that observing a very good pairwise model on a subsystem of a large 
system constrains the structure and function of the real sized network? Does it mean that there is something 
unique about the role of pairwise interactions in the real sized system? Answering this question depends to 
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(a) (d) 




Figure 2: Scatter plots showing the re- 
sults of TAP and SM approximations ver- 
sus the Boltzmann results for various time 
bin sizes 5t and N = 100. Panels (a), 
(b) and (c) show the TAP results versus 
the Boltzmann results for data binned at 
2 ms, 10 ms and 32 ms, respectively, (d), 
(e) and (f ) show the same but for the SM 
approximation. Note that the structure of 
the error in estimating the couplings from 
the TAP equations changes when the size 
of the time bin is increased. 



-10 1 -10 1 

J'ij Jij 




-10 1 -10 1 



a large extent on answering the extrapolation problem: to what degree the experimentally reported success 
of pairwise models holds for the real sized system? In what follows, we discuss some theoretical results that 
bear on this question. 

4.1 Entropy difference 

The extrapolation problem was first addressed in [5] by analyzing the dependence of the misfit measure 
A (defined in Eq. ^) on A^. The authors considered an arbitrary true distribution and computed the 
KL divergence between this distribution and an independent-neuron model as well as between it and the 
pairwise model. This was done using a perturbative expension in N5 <^ 1. The results was the following 
equations: 

Dkl{p\ I Find) = 5ind - Strnc = 9indN{N - 1)6^ + 0{N6^) (15a) 
DkUpW Ppair) = Vir " -^true = ffpairA(A - 1)(A - 2)6^ + 0{{N6^) (15b) 

where g'ind and (^pair are constant that do not depend on N or 5t and are defined in Eqs. ()A-llaP and 
()A-llap . Using these expression for the KL divergences yields 

A = ^{N -2)6. (16) 

9ind 

Eqs. (jl5p and ()16p show that for small N5, A will be very close to 0, independent of the structure of 
the true distribution. In other words, in this regime, a very good pairwise-model fit is a generic property 
and does not tell us anything new about the underlying structure of the true probability distribution. It is 
important to note that the perturbative expansion is always valid if 5t is small enough. That is, simply by 
choosing a sufficiently small time bin, we can push A as close to as we want. 
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Figure 3: The quality of the pairwise and independent models for different time bins and populations sizes, 
(a) DKL{ptrue\\Pmd) versus N, (b) -DxL(ptrue I Impair) versus N and (c) A versus N all for 6t = 10 ms. (d), 
(e), (f) show the same things for 6t = 2 ms. In all panels, the black stars represent quantities as computed 
directly from the simulated data, while the red squares show the predictions of the low-rate expansion, i.e. 
Eqs. (jlSp and (jl6p . We have used 18000 seconds of simulated data for computing the plotted quantities and 
have corrected for finite sampling bias as described in |6j. 



In [5], 5pair and gi^d are related to the parameters of the pairwise model up to corrections of 0{N6). In 
Appendix[Al we present a different derivation by expanding the partition function of a true Gibbs probability 
distribution around the partition function of a distribution without couplings. In the following subsection, 
we also use the results of this derivation to compare the probability of synchronous spikes under the model 
and the true distributions. Furthermore, in Appendix Oi we extend the idea beyond the independent-pair 
approximation and approximate the entropy of a given distribution as a sum over the entropies of triplets of 
isolated neurons. We show that this approach leads to Eqs. (jl5p in a substantially simpler way than those 
reported in [5] and Appendix lAl 

In Fig. [3l we show how Dkl{p\ \ Pind), D^LipW Ppair) and A vary with N and 6t for data generated from 
a simulated network. We have also plotted the predictions of the low-rate expansion Eqs. (jlSp and (jl6p . As 
shown in these figures, the low-rate expansion nicely predicts the behaviour of the measurements from the 
simulations particularly for small N and 5t. For 6t = 10 ms, we have 5iq = 0.076 and for 6t = 2 ms, we 
have 62 = 0.019 (note that this gives 610/62 = 3.95, a ratio that would have been equal to 5 if the bins were 
independent). Both the results from the low-rate expansion and those found directly from the simulations 
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show that using finer time bins decrease A for fixed A^. 
4.2 Probability of simultaneous spikes 

As discussed in sec. [2l in addition to entropic measures such as KL divergence and A, which in a sense asks 
how well the model approximates the experimental probabilities of all possible spike patterns, we can restrict 
our quality measure to a subset of possible spike patterns. We can, for instance, ask how well the pairwise 
model approximates the probability of M simultaneous spikes. Similar to the case of A, before getting too 
impressed about the power of pairwise models in approximating the true distributions, we should find out 
what we expect in the case of an arbitrary, or random, true probability distribution. 

Suppose now that we have a distribution over a set of variables of the form of Eq. (|A-ip . For this 
distribution, the probability that a set of M neurons, / = {ii, • • • > H/} out of the whole population of N 
fire in a time bin while the rest do not is 

logpi = 5^ + ^ Jij + J2 ^^i^ + (1^) 

i£l i<j&I i<j<kel 

Averaging over all possible / we get 

q{M,N) ^ (^^) ^logpi = Mn+ )^+ (f )^ + • • • - (18) 

where 7i, J and /C are the means of the fields and pairwise, third-order etc coupling. In Appendix [Aj we 
show that, to leading order in Nb^ the external fields and pairwise couplings of fitted models (independent 
or pairwise) match those of the true model (see Eqs. ()A-7aP and ()A-7bp ). Using this, we see that 



gtrue(M, N) 
qtrne{M, N) 



qind{M,N) r 
qp^^,{M, N) 



{JN){M/NY + {JCN^){M/Ny + 0{{M/Ny 
(1CN^){M/Nf + OHM/NY). 



(19a) 
(19b) 



To have well defined behviour in large limit, one should have JN ~ 1 and JCN^ ~ 1. Eq. ([HD 
show that, for M/N <C 1, both the independent and pairwise models are close to the true distribution. 
For M/N ~ 0(1) (of course still M/N < 1), the difference between the model and true probabilities of 
observing M synchronous spike increases. For all ranges of M, the difference is larger for the independent 
model. These predictions are consistent with the experimental results found in the retina |12j . 

In the above calculation, we compute the difference between the true and model values of q the mean 
of the log probability of M synchronous spikes. However, one can ask about the log mean probability of M 
synchronous spikes, i.e. 



w{M, N) 



log 



N 
M 



exp 



^ Tii + ^ Jij + ^ ICijk + 

i&I i<j&I i<j<kel 



logZ. 



(20) 



Caculating w is in theory very hard, because the second term above involves calculating the averages of 
exponential functions of variables. However, if the population is homogenous enough, such that the average 
couplings and fields from one sample of M neurons to the next does not change much, we can approximate 
the average of the exponential of a variable with the exponential of the average it. Doing this, will again 
lead to Eq. ()19p . The difference between the two measure w and q will most likely appear for small M where 
the averages of the fields and couplings depends on the sample of M neurons more strongly than when M 
is large. 
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5 Extensions of the binary pairwise model 



In the previous sections, we described various approximate methods for fitting a pairwise model of the type 
of Eq. Q. We also studied how good a model it will be for spike trains, using analytical calculations and 
computer simulations. As we describe below, there are two issues with a model of the type of Eq. ([3|) that 
lead to new directions for extending the pairwise models studied here. 

The first issue is the use of binary variables as a representation of the states of the system. For fine 
time bins and neural spike trains, the binary representation serves its purpose very well. However, in many 
other systems, a binary representation will be a naive simplification. Examples of such systems are modular 
models the cortex in which the state of each cortical module is describe by a variable taking a number of 
states usually much larger than 2. In the subsection 15. II we briefly describe a simple non-binary model useful 
for modelling the statistics of such systems. 

The second issue is that by using Eq. ([3]) in cortical networks, one is essentially approximating the 
statistics of a highly non-equilbium system with asymmetric physical interactions, e.g. a balanced cortical 
network, by an equilibrium distribution with symmetric interactions. This manifests itself in a lack of a 
simple relationship between the functional connectives to real physical connections. In our simulations we 
observed that there was no obvious relation between the synaptic connectivity and the inferred functional 
connections. Second, as we showed here and in our previous work, for large populations, the model quality 
decays. Although one can avoid this decay by decreasing 6t as grows, eventually one will get into the 
regime of very fine 6t, where the assumption of independent bins used to build the model does not hold any 
more and one should start including the state transitions in the spike patterns [5]. In fact, Tang et al [4] 
showed that even in the cases that the pairwise distribution of Eq. (I3|) is a good model for predicting the 
distribution of spike patterns, it will not be a good one for predicting the transition probabilities between 
them. These observations encourage one to go beyond an equilibrium distribution with symmetric weights. 
In the second extension, described in subsection 15.21 we propose one such model, although a detailed study 
of the properties of such model is beyond the scope of this paper. 



5.1 Extension to non-binary variables 

The binary representation is probably a good one for spike trains binned into fine time bins. However, for 
larger time bins where there is a considerable probability of observing more than one spike in a bin, as well 
as for a number of other systems, the binary representation may only serve as a naive simplification and 
going to non-binary representations is warranted. Example of such systems include the protein chains and 
modular cortical models. In probabilistic models of protein chains, each site is represented by a non-binary 
variable that takes one of its possible q states depending on the amino acid that sits on that site. In a number 
of models for the operations of cortical networks, one considers a network of interconnected modules, the 
state of each of which is represented by a non-binary variable |19( [20] . Each state of one such variable 
correspond to e.g. one of the many memory states stored in the corresponding module. 

For a set of non-binary variables a = {ai, £72, (Tat), cTj = 1 ... g, one can simply write down a maximum 
entropy pairwise Gibss distribution as 



p{a) 



1 

Z 



i oi i<j a, 13 

Sla,, 



(21a) 
(21b) 



where a and /3 go from 1 . . . q and index the q possible states of each variable. For q = 2, the above 
distribution reduces to the binary case with boolean variables, and when one forces J'^^ = for a ^ (3 one 
recovers the q — state Potts model. Similar to the binary case, here also one is given the experimentally 
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observed values of (liao-i )data and {Uf^fj^upcFj)Aa,t& and wants to infer the fields and couplings that consistent 
with them. 

The approximate methods described in this paper can be adopted, with some effort, to the case of 
non-binary variable as well. In particular, it is easy to derive the difference between the entropy of the 
true distribution, the pairwise model and the independent model in the low-rate limit of the non-binary 
model. Assuming that (iiQCTi)data = C(e) for a 7^ 1, the low-rate regime in the case of non-binary variables 
is characterised byA^(g — l)e<Cl. In this regime, similar to the binary case, 5pair — S'truc oc {N{q — l)e)^ 
and 5ind — ^true {N{q — l)e)^ and A = N{q — l)e, and consequently the pairwise model performs very 
well in the low-rate regime. 

As we described before, the low-rate regime is where most experimental studies on binary pairwise models 
were performed, and the result of the low-rate expansion of the entropies explains the reported success of 
binary pairwise models in those studies. On the other hand, the low-rate regime of the non-binary variable 
may be of little use. This is because the systems to which the non-binary representation should be applied 
are unlikely to fall into the low-rate regime. For instance, in the case of modular memory networks the 
low-rate regime would be the case in which the network spends a significantly larger time in its ground state 
(no memory retrieved) compared to the time it spends operating and retrieving memory. A more likely 
scenario is the one where all the states (memory or no-memory) have approximately similar probabilities of 
occurrence over a period of time. How useful pairwise models are in describing the statistics of such non- 
binary systems away from the trivial regime of the low-rate expansion is not known. Studying the quality 
of non-pairwise models in these cases and developing efficient ways to fit such models, in particular based 
on extensions of the powerful approximations such TAP and SM to non-binary variables will be the focus 
of future work. In particular, it is important to note that writing the SM approximation for the non-binary 
case will be a straightforward task in light of the relation we described between the SM, nMFT and IP 
approximations in sec. 13.41 

5.2 Extension to dynamics and asymmetric interactions 

The Glauber model [21] is the simplest dynamical model that has a stationary distribution equal to the 
Ising model distribution Eq. ([3|). It is defined by a simple stochastic dynamics in which at each timestep 
5t = tq/N one spin is chosen randomly and updated, taking the value -|-1 (i.e., the neuron spikes) with 
probability 



Although the interactions Jij in the static Ising model are symmetric (any antisymmetric piece would cancel 
in computing the distribution ([3])), a Glauber model with asymmetric Jij is perfectly possible |22l I23j . 

This kinetic Ising model is closely related to another class of recently-studied network models called 
generalized linear models (GLMs) \24: \ \25 \ [26] . In a GLM, neurons receive a net input from other neurons of 
the linear form. 



and spike with a probability per unit time equal to a function / of this input. Maximum- likelihood techniques 
have been developed for solving the inverse problem for GLMs (finding the linear kernels that give the 
stimulus input and that from the other neurons in the network, given spike train data) |24[ [25] . They have 
been applied successfully to analyzing spatiotemporal correlations in populations of retinal ganglion cells 



1 



(22) 



P+ = 



1 + exp[-2{hi + JijSj)] ' 




(23) 
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The Glauber model looks superficially like a GLM with instantaneous interactions and /(x) equal to a 
logistic sigmoid function l/(l+exp(— 2x)). But there is a difference in the dynamics: spins are not spikes. In 
the Glauber model, a spin/neuron retains its value (+1 or —1) until it is chosen again for updating. Since the 
updating is random, this persistence time is exponentially distributed with mean tq. Thus a Glauber-model 
"spike" has a variable width, and the autocorrelation function of a free spin exhibits exponential decay with 
a time constant of tq. In the GLM, in contrast, a spike is really a spike and the autocorrelation function 
(for constant input) is a delta-function at i = 0. 

The time constant characterizing the kernel Jijir) in a GLM has a similar effect to tq in the Glauber 
model, but a GLM with an exponential kernel is not exactly equivalent to the Glauber model. In the 
GLM, the effect of a presynaptic spike is spread out and delayed in time by the kernel, but once it is felt, 
the postsynaptic firing rate changes immediately. In the Glauber model, the presynaptic "spike" is felt 
instantaneously and without delay, but the firing state of the neuron takes on the order of tq to change in 
response. 

GLMs grew out of a class of single-neuron models called LNP models. The name LNP comes from the 
fact that there is a linear (L) filtering of the inputs, the result of which is fed to a nonlinear (N) function 
that specifies an instantaneous Poisson (P) firing rate. In the earlier studies, the focus was on the sensory 
periphery, where the input was an externally specified "stimulus". An aim of this modeling effort was to 
improve on classical linear receptive field models. Thus, in the usual formulation of a GLM network, one 
writes the total input as a sum of two terms, one linear in the stimulus as in the LNP model and the other 
linear in the spike trains of the other neurons. Of course, one can trivially add a "stimulus" term in a 
Glauber model, so this differences is not an essential one. 

Similarly, interactions with temporal kernels Jijir) can be included straightforwardly in a Glauber model. 
Such a model is equivalent to a GLM in the limit tq — > 0. (One has to multiply the kernels by 1/r while 
taking the limit, because the integrated strength of a "Glauber spike" is proportional to tq, while that of 
an ordinary spike in a GLM is 1.) 

One can derive a learning algorithm for a Glauber model, given its history, in a standard way, by 
maximizing the likelihood of the history. The update rule, which is exact in the same way that Boltzmann 
learning is for the symmetric model, is 

SJij = r]{[si{ti + e) - tanh(6i + ^ JikSsk{ti))]6sj{ti)), (24) 

k 

where 6sj{t) = Sj{t) — (sj), the average is over the times ti at which unit i is updated, and 6j = tanh~^(sj) = 
hi + Y.jJij{sj)- 

One can also get a simple and potentially useful approximate algorithm which requires no iteration by 
expanding the tanh in eqn. ()24p to first order in the Jij (i.e., to first order around the independent-neuron 
model). Then at convergence (5 Jij = 0) we have 

{6si{ti + e)6sj{ti)) = (1 - (s,)2) Jik{6sk{ti)6sj{U)), (25) 

k 

which is a simple linear matrix equation that can be solved for the Jij. 

6 Discussion 

The brain synthesizes higher-level concepts from multiple inputs received, and in many fields of research 
scientists are interested in inferring as simple a description as possible, given potentially vast amounts of 
data. Such processes of learning take a set of average values, correlations, or any other pattern in the data, 
and from these arrive at another representation, which is useful for speed or accuracy of prediction, for 
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compressed storage of the data, as a grounds for decision making, or in any other aspect which improves 
the functionahty or competitiveness of the brain or the researcher. Models and algorithms for learning in 
these contexts have been studied, in great detail, in neuroscience, image processing, and many other fields, 
for quite some time; consider e.g. the introduction of Boltzmann machines more than a quarter of a century 
ago [27j . and the now more than ten- year old monograph on learning in graphical models |28| . 

As one could have expected, there is a trade-off between the complexity of the model to be learned and 
the efficiency of the learning algorithms. Boltzmann machines are, in principle, able to learn very complex 
models, and provably so, but convergence is then quite slow; modern applications of those methods center 
on various restricted Boltzmann machine models, which can be learned faster |29j . 

When we consider very large systems and/or problems where convergence time of the learning is a serious 
concern, then we are primarily interested in those very fast processes of learning which could be nick-named 
"immediate understanding" , or "iterative understanding" . By this we mean that the outcome of learning 
should be read out directly by one or a series of mathematical transformations of the data. The central 
issue is then obviously the accuracy of the learning outcomes. In this paper we have revisited some classical 
and some more recent algorithms of this kind that take their inspiration from statistical physics. The two 
simplest algorithms considered are naive mean-field and independent pair approximation. As we have shown 
for neural data, both of these perform poorly except for small systems (small A^), of for systems which have 
little variability (small value of N5), where 6 here can be thought of as a proxy for the deviation from a 
uniform state). On the other hand, one generalization of each method, the TAP approximation for the naive 
mean-field and the recent Sessak-Monasson approximation for the independent pair approximation, perform 
much better. We believe that there is scope for further ideas, and note particularly the recent reported 
generalization of the TAP equations to learning within a Bethe-lattice approximation; this approach, so far 
only carried out for pairwise binary models, could be a promising avenue for learning more generally in large 
systems where an underlying connectivity is locally tree-like [30]. We also studied the quality of pairwise 
models using a variety of methods. We derived mathematical expressions relating the quality of pairwise 
models to the size of the population and the lower order statistics of the spike trains employing a number 
of approximation schemes. 

To conclude, we have here framed the presentation in terms of inferring representations of neural data 
and assessing the goodness of the model. Although our focus was on neural data analysis, it is important 
to note that similar problems appear in many other fields of modern biology, for instance, e.g. in network 
reconstruction of gene regulatory networks, in genomic assembly in metagenomics projects, and in many 
other problems. Given the present explosion in sequencing technologies it is conceivable that the more novel 
applications will soon appear outside neuroscience. 



Appendices 

A The partition function, entropy and moments of a Gibbs distribution 
in the hmit N6 

Suppose we have a true distribution of the following form 



In this Appendix, we first find the relation between the the external fields and pairwise and third order 
couplings. As we show below, this allows us to compute the probability of synchronous spikes. We also 





i<j 



i<j<k 
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compute the entropy of this distribution, in the small spike probability regime to derive the expression in 
Eqs. (jlSp . This task can be accomplished much more easily if we rewrite the distribution of Eq. ()A-ip in 
terms of zero-one variables rj, instead of the the spin variables Sj, i.e. 



p(r) = ^ exp 

n = (i + Si)/2, 



/3 ( ^ T-iiri + ^ JijriTj + ^ JCijkrirjVk + ... 

i i<j i<j<k 



(A-2a) 
(A-2b) 



where the auxiliary inverse temperature (3 is introduced because it will allow us to compute the entropy, as 
will become clear later. 

The crucial step in computing the relation between the moments, parameters and entropy of a distri- 
bution is computing its partition function. To compute the partition function of Eq. ()A-2a|l . Z, we note 
that 



= ^ - 1 



exp 



oo ^ 

E- 



n=l 



P JijriTj + ^ PKLijkriTjVk + 

i<j i<j<k 



P JijriTj + ^ PlCijkriTjrk + ... 

i<j i<j<k 



(A-3) 



where ()o indicates averaging with respect to the distribution 



Po{r) = ^exp 



(A-4) 



Note that pQ is not the independent model for p in Eq. ()A-2p but only the part of this distribution that 
includes the fields. In fact, as we show in the following (Eq. ()A-7ap ) the fields of the independent model to 
p only match 7ii to 0{N5) corrections, i.e. Tii = TCf^'^ + 0{NS). When dealing with corrections to the fields 
and couplings, this note will be important. 

Since (r")o = (rj)o = 5j, a term with / distinct indices in the expansion of the term inside the average in 
Eq. (lA^ is of the order of 5'. Therefore, to 0{{N5f), we have 



i<j<k ni,n2 



•ni+n2 



ni\n2\ 



i<j<k ni,n2,n3 



fin nni+n2 

+ E E ^^5^^^^.^^ + E E "'^^ ^ '""^ ' '""^ ' '""^ 



i<j<k n=l 



i<j<kni,n2 



^ , ..^ , ani+n2+n3 

_^ ti T<^m \ 7*^2 7^3 7-712 7^3 _L 7"2 7^31 



i<j<k ni,n2,n3 
i<j<kni,n2,n3,n4, 



ni+n2+n3+n4 



ni\n2\n'i\n/i\ 



(A-5) 
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where the sums over n, 711,712,^3 and 714 run from 1 to infinity. Performing these sums yields 

r = '^(I)ij6i6j+ [4>ij(pik + (Pij'Pjk + (l>ik(pjk + 4'ij4>ik(pjk]Si6j6k 

i<j i<j<k 

+ Yl (l^ijk{l + (l>ij){l + M{l + (l)jk)6iSj5k + 0{{N5)^) (A-6) 

i<j<k 

where (pij = exp{(3Jij) — 1 and cpijk = exp(/3/Cjjfc) — 1. Prom Eq. ()A-6|) . we can immediately compute 
the relation between the means, pairwise and three-point correlation functions and the parameters of the 
distribution. Por /5 = 1, we have 



, diogz aiog(i + r) 

mi = {r-i) = — = + 



Si 



(A-7a) 



C,;. 



(rirj) 



Cijk = {^i^j1~k) 



dlogZ _ 51og(l + r) _ 
dlogZ 51og(l + r) 



exp{Jij)6i5j 



1 + Yl ['^^'^ + '^J'^ + <Pik<Pjk]Sk + 0{{N6f 



exp{ICijk) exp{Jij + Jik + Jjk)5i6j5k + 0{N5^ 



(A-7b) 
(A-7c) 



dKijk dICijk 

The relations between means and pairwise correlations and the external fields and pairwise couplings in 
Eq. ()A-7p a and b to their leading orders were reported previously in [5], using a slightly different approach. 
However, the corrections and three-neuron correlations were not computed there. 

An interesting result of this calculation is a relation between the three-neuron correlations for the pairwise 
distribution, i.e. when ICijk = 0, and the lower moments 



^pair 



(rirjrk) 



pair 



(A-e 



mimjmk 

The fact that, to leading order in N5, the external fields and couplings are determined by means and pairwise 
correlations allows us to compute the leading-order probabilities of synchronous spikes reported as we did 
in sec. 14.21 

We can now use Eq. ()A-6P to find the entropies of the distribution in Eq. ()A-2aP 
dlogZ 



S = log(Z) 



dp 



log(Zo) + log(l + P) - X] ^*(^^) - Yl - Yl ^ijkinrjrk). (A-9) 



Prom Eqs. (|A-7p . for an independent fit to the distribution Eq. (jA-2ap . we have TCf^'^ = Tii- Consequently, 
Sind-S = -T + Y, JiMr3)+ ^^3k{nrjrk) + 0{{N6f) 

i<j i<j<k 

= -Y(l',,mimj + Yjij{nr,) + 0{{N6)f. (A-10) 

i<j i<j 

Using the fact that rhi = 6i + 0{N5'^) from Eq. ()A-7ap . and (l)ij 
from Eq. (|A-7bp . we get Eq. (|15ap with gind defined as 



g-md = 



Pij 



N{N 
fhifhj 



nii rrij 



expiJij) - 1 = C^j/{mimj) - 1 + 0{N6) 

(A-lla) 
(A-llb) 
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For a pairwise model to the distribution in Eq. (|A-2ap . we have = 7^^ -)_ 0{{N5)'^) and J^^j^^^ 



J^j + 0{{NSf) and thus 

5'pair ~ -S" = - 0iifc(l + <^ij)(l + 4>ij)i'i^ + (pij)6i5j5j + ^ ICijkinrjrk) 

i<j<k i<j<k 

Using Eq. (|A^ and Eq. (jA^ . this will lead to Eq. (fTSaj) with 

fi'pair defined as 



(A-12) 
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(A- 13a) 
(A-13b) 



B The independent-pair approximation for the external fields in the 
limit ^ ^ 



Replacing 6i = exp(/ij)/(l + exp(/ij)) in Eq. (IA-7ah and solving to find hi we get 



-Hi 



log 
log 



{ri} 

1 - in) 



^cl>,,6,+0{{N6f 
Y^^inl^MM + 0{{N6)% 



in) 



(B-l) 



where in the second line we have used the fact that 



= iglily - 1 + 0{N6) from Eq. fATbl) and that 
(rj) = 6i + 0{N6). Changing the variables from = 0, 1 to the original spin variables Si = ±1, we have 



2^4 2 



l + nii 
1 - rrii 



(B-2) 



For the IP approximation, on the other hand, we have 



K = Jij + \ log 



1 + rrij 
1 - mi 



Ci 



4(1 + mi) 



(B-3) 



Summing /i^ over j and subtracting the over counted terms from single spin contributions gives the same 
expression as Eq. ()B-2p . Therefore, the IP approximation to the external fields get the leading term and 
the first order corrections of the low-rate approximation correctly. 



C Independent-triplet (IT) approximation 

Considering the following Gibbs distribution over three boolean variables i, j, and k, 
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we can use the definition of the means and correlations and write 
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Inverting the matrix of coefficients we can express the probabilities in terms of the means, pairwise and 
third order correlations. Using the result in 



rjk 



log 



log 



Pijkjl, 0,0) 

Pijk{0,0,0) 
Pijk{0,0,0)pijk{l, 1,0) 
Pijk{hO,0)pijk{0,l,0) 



(C-3a) 
(C-3b) 



and similar equations for the other fields and couplings, we can also express these parameters in terms of 
means, pairwise and third order correlations. Before using Eqs. ()C-3P as approximations for the parameters 
of a pairwise model, we need to perform two more steps. 

The first step is a familiar one that we noted when dealing with the external fields in the independent-pair 
approximation, namely the fact that Tii'^ depends on j and k in addition to i and that IC^j depends on k. We 
can use the same logic that we used in building an independent-pair approximation to the external fields, 
and build an approximation to the external fields that does not depended on j and k, and an approximation 
to the couplings that does not depend on k. For example, for the case of the pairwise couplings we get 



(1 - Cijk/Cij)i'i~ - jrhk - Cik - Cjk + C'ijfc)/(1 - rhj- rhj - Cjj)) 
(1 - {Cik - Cijk) /{rhi - Cij)){l - {Cjk - Cijk) /{rhj - Cij)) 



(C-4) 



The second step has to do with the fact that Eqs. ()C-3p (as well as their transformed version after 
performing the first step, e.g. Eq. IC-4p depend on the third order correlations in addition to the pairwise 
correlations and the means. Hence to derive an expression that relates model parameters to pairwise 
correlations and means we should first find the third order correlation in terms of them. Note that this step 
is not present in the independent-pair approximation. To express the third order correlations in terms of 
the lower order statistics we take advantage of the following equation 

p,jk{0, 0, l)pijk{0, 1, 0)pi,k{l,0, 0)p,,k{l, 1, 1) = P^jk{0, 0, 0)pijk{0, 1, l)pijk{l, 0, l)pijk{l, 1, 0). (C-5) 

Writing the probabilities in terms of the moments, this equation can be solved to find the third order 
correlations in terms of the means and pairwise correlations. The equations have two imaginary solutions 
for Cijk, which are unphysical, and one real solution, which is the correct solution to be considered. The 
resulting expression for Cijk in terms of the means and pairwise correlations is complicated. However, in the 
limit of rhi, rhj-,rhk ~^ 0, it can be shown to have the same form as the one reported in Eq. (|A-8p . We noted 
in the text that in this limit, the IP approximations to the couplings will give the same result as the leading 
order term of Eq. (lA-7bp for the couplings. With the independent-triplet approximation, we can go one step 
further, and as can be shown by doing a small amount of algebra, we can recover 0{N6) corrections to the 
couplings in that we calculated in Eq. (IA-7bp . 
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As mentioned in sec. 13.21 one can continue the above process to build approximations based on quadruples 
of spins and so on. However, this soon becomes difficult in practice for the reason that solving equations of the 
type Eq. (jC-5p to find the higher moments in terms of the means and pairwise correlations will be as difficult 
as the original problem of finding the external fields and couplings of the original body problem in terms 
of the means and correlations. Nevertheless, this simple triplet expansion offer an alternative derivation of 
Eqs. (jlSp . simpler than the derivations in [5] and Appendix[Al Here we show this for Eq. ()15bp . as deriving 
Eq. (jlSah will be similar but less invoved.To derive Eq. (115bp using the triplet expansion, we approximate the 
entropy of the whole system of A^ neurons as a sum of the entropies of all triplets denoted by Sijk- We then 
expand the resulting expression keeping terms of up to 0{6^) noting that Cijk ~ 0{6^), and Cij '-^ 0{6'^). 
The result takes the form of 

= ^ Sijk = - ^ ^ Pijkisi,Sj,Sk)log{pijkisi,Sj,Sk)) 

i<j<k i<j<kSi,Sj,Sh 

= ~ ^ ^ Q{Cijk) ~ Cijk 
i<j<k 

- ^ Cijk - ^og{Cij) - log{Cjk) - log(C'ifc) + log(mi) + log(mj) + log(mfc) 

i<j<k 

- X] - Cij - Cjk) + Q{mi - Cij - Cjk) + Q{rhi - Cij - Cjk) 

i<j<k 

- Yl Qic^j) + QiCik) + QiCjk) 

i<j<k 

- X] -r^i- fiij - mfc + Cij + Cik + Cjk) (C-6) 

i<j<k 

where Q{x) = xlog(a;). For a pairwise model, the independent-triplet approximation to the entropy in 
the limit m — has the same form, except that Cijk of the true model should be replaced by C^^l^ = 

{CijCik)Cjk){rhimjmk)~^ (see Eq. (|A-8P ). i. e. 



•Jpair — ^ijk 

- ^ Q{mi - Cij - Cjk) + Q{rhi - Cij - Cjk) + Qim - Cij - Cjk) 

i<j<k 

- Yl QiCij) + QiQk) + Q{c,k) 

i<j<k 

- Y -rn-i- rhj - rhk + Cij + Qk + Cjk) (C-7) 

i<j<k 

Using Eqs. (fCT^ . (fOTP and (lA^ yields Eq. (pb]) . 

D Derivation of TAP equations from Belief Propagation 

In this appendix we derive the TAP equations (Eq. (jl2p ) starting from the Belief Propagation update rules. 
Let us begin with the result to be established. The TAP equations are a set of nonlinear equations for the 
magnetizations m^, which we will write: 

tanh"^ mi = hi + Y {^'hjmj - e^jfjmi{l - lUj)) + 0{e^). (D-1) 
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Here hi is the external field acting on spin i, eJij are the pairwise couplings, and the notation^' G di means 
that the sum is over neurons j connected with neuron i. As we also did in the text in Eq. (jl2p . the above 
equation is generally quoted with e set to one and without the error term of cubic order in e. 

Starting from the pairwise distribution Eq. ([3]), we define the following distribution, with the auxiliary 
variable e that we set to 1 in the end of our calculation 



P = — exp 



i i<j 



and the exact marginal distribution over spin Si is defined by 

s\si 

where the sum goes over all spins except Sj. The exact magnetization of spin Si is then 

Es Si exp . hiSi + e ^ JijSiSj 
rrii = 



(D-2) 



(D-3) 



Es exp Ei ^iSi + e J2i<j JijSiS, 



(D-4) 



Belief Propagation is a family of methods for approximately computing the marginal distributions of 
probability distributions |3H [321 [T7] . Since the model defined by Eq. (|D-2p contains only pairwise interac- 
tions, it is convenient to adopt the pairwise Markovian Random Field formalism of Yedidia, Freeman and 
Weiss [32]. Note that the important recent contribution by Mezard and Mora uses a more general formalism, 
which may prove to be more convenient in the perspective of extending an "inverse BP" learning algorithm 
beyond pairwise models |30| . 

Belief Propagation applied to the model in Eq. ()D-2p in the Yedidia- Weiss-Freeman formalism is built 
on probability distributions, called BP messages, associated with every directed link in the graph. If i ^ j 
is such a link, starting at i and ending at j, then the BP message r]i^j{sj) is a probability distribution on 
the variable sj associated to node where the link ends. For Ising spins one can use the parametrization 



1 + Sjmi^j 



(D-5) 



where rrii^j is a real number called the cavity magnetization. BP is characterized by two equations, the 
Belief Propagation update equations, and the Belief Propagation output equations. The BP update equations 
are used iteratively to find a fixed point, which is an extremum of the Bethe free energy. At the fixed point, 
the BP update equations form a (large) set of compatibility conditions for the r/'s which, for the model 
Eq. (jDSD, read 



Vj- 



n ^fc-i 

k&dj\i 



is,). 



The BP output equations determine the marginal probability distributions from the r/'s and read 



(D-6) 



(D-7) 



where r2j_»j and Jlj in Eqs. ()D-6p and (jD-7p are normalizations. To lighten the notation, we will not 
distinguish between the exact marginals, as in Eq. ()D-3p . and the approximate marginals from BP, as in 
Eq. dEZl). 
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We now write the BP update and BP output equations using the cavity magnetizations, rui^j, from 
Eq. ()D-5p . The notation simphfies if one define an ancillary quantity 

1 1 + rrii^j 1 

(7,_>i = - log = tanh rrii^j (D-8) 

Ht^3 2 ^ 1 - mi^j ^ ^ ' 

in terms of which the BP update equation can be written as 

tanh(e Jjj) tanh(/ij + Qk^j) (D-9) 



m 



and the BP output equation as 

rrii = tanh(/ij + Qj^i)- (D-10) 

The task is now to expand the right hand side of Eq. ()D-10p in e and compare with the TAP equations, 
Eqs. (ID-lh . To do this, we first note that 

qj^i = eJij tanh(/ij + ^ qk^j) + C(e^) = eJijUij + O(e^). (D-11) 

This follows from expanding Eq. ()D-9p in e, using the result in Eq. ()D-8p . and finally, expanding the 
logarithm. We then rewrite the BP output equation ()D-10p as 

tanh"^ TTij = /ij + e ^ Jij tanh(/ij - qi^j + ^ qk-*j) + 0{e^). (D-12) 

Since nij = tanh(/ij + "^^ed 1k-^j) (i^o expansion in e) we want to separate qi^j and hj + Ylked Ik^j 

arguments of the tanh'es in Eq. (ID-12p . and according to Eq. (|D-lip . qi^j is of order e. This means that 
we can write, to order e, 

tanh(/ij - qi^j + ^ qk^j) = mj - eJijmi{l - ttij). (D-13) 

kedj 

Introducing this into Eq. ()D-12p . we finally have 

tanh"^ mi = hi + ^{eJijmj - e^J^jnii^l - mj)) + 0{e^) (D-14) 

which was to be proved. 
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